restart workers after executing 10 tasks to mitigate memory leaks #840

Adam-D-Lewis · 2024-07-02T17:04:10Z

Fixes #nebari-dev/nebari#2418

Description

This pull request:

restarts celery workers after they execute 10 tasks (docs here). This mitigates memory leaks which are occurring.

Pull request checklist

Did you test this change locally?
Did you update the documentation (if required)?
Did you add/update relevant tests for this change (if required)?

Additional information

How to test

@trallard edited from: #840 (comment)

To test, you could check out this branch, then run conda standalone conda.store/conda-store/how-tos/install-standalone or you can run it in [docker containers](docker-compose up --build -d as explained here). After you startup, you can visit localhost:8080/conda-store/admin and rebuild the environment it comes with a few times. If you watch the docker memory usage (docker stats) you'll see that the memory usage does not continue to climb with each new build as it does without this option set.

netlify · 2024-07-02T17:04:25Z

✅ Deploy Preview for conda-store ready!

Name	Link
🔨 Latest commit	`d0c6085`
🔍 Latest deploy log	https://app.netlify.com/sites/conda-store/deploys/66a01029aea96c0009c5fb96
😎 Deploy Preview	https://deploy-preview-840--conda-store.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Adam-D-Lewis · 2024-07-02T17:06:30Z

@dcmcand @peytondmurray @kcpevey @trallard Could I get a review on this?

Adam-D-Lewis · 2024-07-02T22:36:55Z

I'm happy to make this configurable via a traitlet as well if desired, but the worker restarts seem happen quickly so I can't see too much down side to just applying it for all conda store deployments.

conda-store-server/conda_store_server/_internal/worker/app.py

kcpevey · 2024-07-03T18:26:39Z

I can't speak to the technical aspect of this, but I'm very glad to see it. I see the memory leakage on all my deployments. As far as I can tell it leads to an erroneously failed build due to OOM - which also generally fails with errors that make no sense. All around, this is a big admin management and end user UX improvement.

peytondmurray

Huh, with this option it looks like celery will just automatically restart the worker after 10 tasks are completed. My one complaint with this is it doesn't strictly speaking fix nebari-dev/nebari#2418 as we are only treating a symptom not the memory usage itself.

But in any case from what I have read it seems like this is exactly the kind of use case the option was intended for, so I guess let's just use it. Is there an easy way to test this? If so it would be nice to ensure that it works.

Adam-D-Lewis · 2024-07-04T01:33:53Z

Huh, with this option it looks like celery will just automatically restart the worker after 10 tasks are completed. My one complaint with this is it doesn't strictly speaking fix nebari-dev/nebari#2418 as we are only treating a symptom not the memory usage itself.

I agree it does just treat the symptom without fixing the underlying issue. That said, I think this will work robustly.

But in any case from what I have read it seems like this is exactly the kind of use case the option was intended for, so I guess let's just use it. Is there an easy way to test this? If so it would be nice to ensure that it works.

To test, you could check out this branch, then run conda standalone https://conda.store/conda-store/how-tos/install-standalone or you can run it in [docker containers](docker-compose up --build -d as explained here). After you startup, you can visit localhost:8080/conda-store/admin and rebuild the environment it comes with a few times. If you watch the docker memory usage (docker stats) you'll see that the memory usage does not continue to climb with each new build as it does without this option set.

peytondmurray · 2024-07-04T01:36:33Z

To test, you could check out this branch, then run conda standalone https://conda.store/conda-store/how-tos/install-standalone or you can run it in [docker containers](docker-compose up --build -d as explained here). After you startup, you can visit localhost:8080/conda-store/admin and rebuild the environment it comes with a few times. If you watch the docker memory usage (docker stats) you'll see that the memory usage does not continue to climb with each new build as it does without this option set.

Ahh, what I meant was is there a simple way of adding automated testing?

trallard · 2024-07-04T12:23:03Z

Ahh, what I meant was is there a simple way of adding automated testing?

I am +1 with Peyton here. It is good that we can manually verify (once or so), but I would prefer having some sort of automated tests (akin to load tests) to ensure this is, in fact, a useful fix.
Those tests could later be used to also do some benchmarking/memory checks.
So while the fix itself seems fine I would suggest holding on merging until the tests have been added.

Adam-D-Lewis · 2024-07-09T05:11:21Z

I've followed up on where the memory leak is occurring using memray. I'm new to memray, but the flamegraph seems to point to line 20 in list_conda_prefix_packages at /opt/conda-store-server/conda_store_server/_internal/action/add_conda_prefix_packages.py which is prefix_data.load() on line 20 here. You can see more of the call stack in the screenshot. I believe what memray is telling us is that the data that is loaded on that line stays in memory for some reason. Not that that particular line causes the leak, but that is the object that is sticking around.

I'm not sure what that code is doing or why that would cause a memory leak, but I'm recording it here for now. I'll try to find out more later or how to do a more targeted fix.

peytondmurray · 2024-07-09T05:28:47Z

Thanks, this is actually enough to start properly narrowing this down to something we can properly troubleshoot. I still think it would be worth merging this feature as it doesn't seem like there's a real cost to this. I'll open another issue with the goal of actually eliminating the memory leak.

trallard · 2024-07-09T13:15:44Z

While I can see why add_conda_prefix_packages.py could lead to memory growth (and potential issues if the json.load() and its in-mem representations are not handled accordingly, I cannot see a definite indication that this is indeed the culprit.
add_conda_prefix_packages.py definitely spends a lot of time in CPU but that does not necessarily mean it is leading to memory leaks or bloat.

So we'd definitely need to do more profiling and mem analysis over time.
@Adam-D-Lewis can you share a minimally reproducible example of how you got to the flame graph you shared.

As per merging, since this PR has been marked as needing tests I think we should still hold until we have a test in place.

Adam-D-Lewis · 2024-07-09T17:49:14Z

I plan to write what I did to generate the memray flamegraph soon.

As far as a test showing the leak, what I'm planning on doing is running a test which builds many conda envs and show that the memory used by the conda store worker docker container continues to rise with each build.

We could maybe use the limit_memory decorator in pytest-memray to make the test fail if it uses too much memory, but this would only be possible if we use the pytest-celery setup which I was having trouble with so I prefer the docker tests at the moment.

Adam-D-Lewis · 2024-07-09T20:02:34Z

@trallard More info about the flamegraph generation in the new issue Peyton opened - #848 (comment)

peytondmurray

After a discussion with @Adam-D-Lewis and @trallard, it seems like this is potentially only observable in docker. For now we will just merge this now without a test as it is low risk, and come back to testing memory consumption later on once we know more.

restart workers after executing 10 tasks to mitigate memory leaks

5a16a49

aktech reviewed Jul 3, 2024

View reviewed changes

conda-store-server/conda_store_server/_internal/worker/app.py Show resolved Hide resolved

peytondmurray reviewed Jul 3, 2024

View reviewed changes

peytondmurray added status: in progress 🏗 impact: high 🟥 This issue affects most of the conda-store users or is a critical issue area: user experience 👩🏻‍💻 Items impacting the end-user experience type: maintenance 🛠 needs: testing ✅ labels Jul 3, 2024

Adam-D-Lewis requested review from peytondmurray and aktech July 4, 2024 01:34

Adam-D-Lewis and others added 2 commits July 5, 2024 16:58

add memory leak test

d097c90

[pre-commit.ci] Apply automatic pre-commit fixes

5befd84

peytondmurray mentioned this pull request Jul 9, 2024

[BUG] - Memory leaks during environment creation tasks #848

Closed

trallard mentioned this pull request Jul 9, 2024

[ENH] - Investigate conda-store performance #645

Closed

Adam-D-Lewis marked this pull request as draft July 9, 2024 19:37

peytondmurray added the needs: discussion 💬 This item needs team-level discussion before scoping label Jul 23, 2024

remove unfinished test

c8cdb5f

Adam-D-Lewis marked this pull request as ready for review July 23, 2024 20:18

Merge branch 'main' into main

d0c6085

peytondmurray approved these changes Jul 23, 2024

View reviewed changes

peytondmurray merged commit 1df30a2 into conda-incubator:main Jul 23, 2024
26 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restart workers after executing 10 tasks to mitigate memory leaks #840

restart workers after executing 10 tasks to mitigate memory leaks #840

Adam-D-Lewis commented Jul 2, 2024 •

edited by trallard

Loading

netlify bot commented Jul 2, 2024 •

edited

Loading

Adam-D-Lewis commented Jul 2, 2024 •

edited

Loading

Adam-D-Lewis commented Jul 2, 2024

kcpevey commented Jul 3, 2024

peytondmurray left a comment

Adam-D-Lewis commented Jul 4, 2024 •

edited

Loading

peytondmurray commented Jul 4, 2024

trallard commented Jul 4, 2024

Adam-D-Lewis commented Jul 9, 2024 •

edited

Loading

peytondmurray commented Jul 9, 2024

trallard commented Jul 9, 2024

Adam-D-Lewis commented Jul 9, 2024 •

edited

Loading

Adam-D-Lewis commented Jul 9, 2024

peytondmurray left a comment

restart workers after executing 10 tasks to mitigate memory leaks #840

restart workers after executing 10 tasks to mitigate memory leaks #840

Conversation

Adam-D-Lewis commented Jul 2, 2024 • edited by trallard Loading

Description

Pull request checklist

Additional information

How to test

netlify bot commented Jul 2, 2024 • edited Loading

✅ Deploy Preview for conda-store ready!

Adam-D-Lewis commented Jul 2, 2024 • edited Loading

Adam-D-Lewis commented Jul 2, 2024

kcpevey commented Jul 3, 2024

peytondmurray left a comment

Choose a reason for hiding this comment

Adam-D-Lewis commented Jul 4, 2024 • edited Loading

peytondmurray commented Jul 4, 2024

trallard commented Jul 4, 2024

Adam-D-Lewis commented Jul 9, 2024 • edited Loading

peytondmurray commented Jul 9, 2024

trallard commented Jul 9, 2024

Adam-D-Lewis commented Jul 9, 2024 • edited Loading

Adam-D-Lewis commented Jul 9, 2024

peytondmurray left a comment

Choose a reason for hiding this comment

Adam-D-Lewis commented Jul 2, 2024 •

edited by trallard

Loading

netlify bot commented Jul 2, 2024 •

edited

Loading

Adam-D-Lewis commented Jul 2, 2024 •

edited

Loading

Adam-D-Lewis commented Jul 4, 2024 •

edited

Loading

Adam-D-Lewis commented Jul 9, 2024 •

edited

Loading

Adam-D-Lewis commented Jul 9, 2024 •

edited

Loading